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This  paper  presents  a  discussion  of  the  application  of 
data-flow  machine  concepts  to  the  desiqn  and  implementation  of 
database  machines  which  execute  relational  alnebra  queries.  we 
analyze  the  performance  of  multiprocessor  nested-loops  and  sort- 
nerqe  join  algorithms  and  show  that  the  nested-loops  alnorithm  is 
qenerally  superior.  Three  levels  of  operand  qranularity  for 
data-flow  database  machines  are  introduced  and  compared  usina  the 
nested-loops  join  alqorithm.  We  demonstrate,  that  relation-level 
granularity  is  too  coarse  and  that  tuple-level  granularity  is  too 
fine.  The  third  level  of  granularity,  a  paqe  of  a  relation,  is 
shown  to  be  the  best  choice  from  both  hardware  and  software 
viewpoints.  Finally  a  preliminary  desiqn  for  a  data-flow  data¬ 
base  machine  which  utilizes  paqe-level  granularity  and  supports 
distributed  control  of  instruction  execution  is  presented. 
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SIGNIFICANCE  AND  EXPLANATION 


A  preliminary  arch  i  tectu  re  of  a  database  machine  is 
presented.  This  architecture  uses  the  presence  of  data  as 
the  criterion  for  instruction  initiation  and  control  rather 
than  the  instruction's  nosition  in  the  prooran,  Such  a 
machine  is  known  as  a  data-flow  machine.  The  use  of  data¬ 
flow  techninues  in  database  machines  has  been  shown  to  be 
promisino  elsewhere,  however,  to  this  date  no  architecture 
usino  these  techninues  has  been  designed.  The  architecture 
presented  is  oniv  a  preliminary  one  and  will  most  iikely 
underao  a  number  of  future  chances. 
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DESIGN  CONSIDERATION  FOR  DATA-FLOW  DATABASE  MACHINES 
Haran  Boral  and  David  J.  DeWitt 

i*  Introduction 

Durinq  the  past  several  years  we  have  been  investiqatinq  the 
desiqn  and  implementation  of  multiprocessor  database  machines  for 
the  execution  of  relational  algebra  queries.  In  [1,2]  the  archi¬ 
tecture  of  DIRECT,  a  mimd  database  machine,  is  described.  The 
problem  of  relation  f raqmentation  and  its  impact  on  query  execu¬ 
tion  time  is  discussed  in  [1] .  In  [4] ,  four  processor  assignment 
strategies  for  MI MD  database  machines  are  described  and 
evaluated.  One  of  the  primary  results  presented  in  [4]  is  that 
the  application  of  data-flow  machine  techniques  to  the  processing 
of  relational  algebra  queries  siqnif icantly  enhances  system  per¬ 
formance.  The  architecture  of  DIRECT  [1,2]  is  that  of  a  data-flow 
machine  where  all  the  control  functions  are  centralized.  In  this 
paper  we  intend  to  present  an  approach  to  constructing  a  data¬ 
flow  database  machine  which  supports  distributed  control. 

In  Section  2.(5,  we  introduce  the  basic  concepts  of  query 
processing  usina  multiple  processors  and  data-flow  machines.  We 
analyze  the  performance  of  multiprocessor  nested-loops  and  sort- 
mercie  join  algorithms  and  show  that  the  nested-loops  algorithm  is 
generally  superior.  In  Section  3.0,  three  levels  of  operand 
granularity  for  data-flow  database  machines  are  introduced  >  nd 
compared.  We  demonstrate,  that  relation-level  aranularity  is  too 
coarse  and  that  tuple-level  granularity  is  too  fine.  The  third 
level  of  granularity,  a  page  of  a  relation,  is  shown  to  be  the 
best  choice  from  both  hardware  and  software  viewpoints.  Section 
4.0  contains  a  prel iminary  desiqn  for  a  data-flow  database 

machine  which  supports  page-level  granularity.  Our  conclusions 

Sponsored  by  the  United  States  Army  under  Contract  No.  DAAG29-75-C-0024  and 
No.  DAAG29-79-C-0165  and  the  National  Science  Foundation  under  Grant  MCS78-01721. 
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and  areas  of  future  research  are  presented  in  Section  5.0. 

2.  Background 

2.1_.  Relational  Query  Process i nq 

Each  relational  algebra  query  is  generally  comprised  of  one 
or  more  relational  alqebra  operations  (instructions)  and  is 
oroanized  in  the  form  of  a  tree.  Each  node,  represents  an  opera¬ 
tion  to  be  performed  on  a  number  of  relations.  Some  examples  are 
restrict,  join,  append,  and  delete.  Modes  hioher  up  in  the  tree 
operate  on  relations  produced  by  nodes  below  them.  Fiqure  2.1 
contains  an  example  of  a  typica 1  relational  alqebra  query  in  the 
form  of  a  query  tree. 

2l._l.l_.  Parallel  Join  Algorithms 

In  a  relational  database  system  one  of  the  most  time  consum¬ 
ing  operations  that  must  be  performed  is  the  join  operator.  In 
[5],  several  alternative  join  algorithms  for  uniprocessor  systems 
are  presented  and  analyzed.  The  results  show  that  in  the  absence 
of  indices  (as  in  DIRECT)  a  sort-merqe  algorithm  performs  best. 
However,  for  mult iprocessor  systems  we  feel  that  the  performance 
of  the  nested-loops  algorithm  is  superior  to  that  of  the  sort- 
merge  algorithm.  To  support  this  claim  we  present  some  intuitive 
arguments  followed  hy  a  short,  informal  analvsis  of  both  alcjo- 
r i thms . 

The  multiprocessor  sort-merqe  algorithm  employs  a  parallel 
sort  of  both  relations  on  the  joininq  attribute.  This  is  fol¬ 
lowed  by  a  uniprocessor  merge  on  the  joining  attribute  to  perform 


R:Restrict 
J  :  J  o  i  n 


Fiaure  2.1 


A  Sample  Query  Tree 
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the  join.  Our  parallel  sort  uses  a  binary  tree  arrangement  of 
processors  as  in  [6] . 

The  multiprocessor  nested-loops  algorithm  works  by  joininci 
each  unit  of  one  (the  outer)  relation  with  all  of  the  units  in 
the  other  (the  inner)  relation.  If  a  unit  corresponds  to  a  pace, 
the  outer  relation  is  n  pages  long,  and  there  are  n  processors 
available,  then  each  processor  can  join  one  page  of  the  outer 
relation  with  the  entire  inner  relation.  A  unit  can  also  be  a 
tuple . 

We  assume  that  each  page  of  both  relations  is  sorted  and 
that  a  merge  algorithm  is  used  by  the  multiprocessor  sort-merge 
algorithm  to  sort  two  pages  and  by  the  multiprocessor  nested- 
loops  algorithm  to  join  two  pages.  Therefore,  the  time  required 
for  a  processor  to  process  two  input  pages  is  the  same  for  both 
algorithms  and  can  be  disregarded  in  the  following  comparison  of 
the  two  algorithms.  Finally,  for  the  sorting  analysis  we  have 
made  a  number  of  simplifying  assumptions.  The  most  significant  is 
that  after  each  stage  all  the  processors  flush  their  buffers. 
Althouqh  optimizations  will  improve  the  total  execution  time  the 
improvement  will  not  be  significant.  Disregarding  optimizations 
makes  the  analysis  easier. 

Intuitively  the  nested-loops  algorithm  should  outperform  the 
sort-merge  since  the  amount  of  parallelism  that  can  be  attained 
is  high  (limited  only  by  the  number  of  pages  in  the  outer  rela¬ 
tion)  and  can  be  maintained  throughout  the  duration  of  the  execu¬ 
tion.  When  sortinq  in  parallel  one  may  be  able  to  start  with  a 
large  number  of  processors  but  after  each  stage  the  number  of 
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processors  decreases  whereas  the  amount  of  data  examined  by  each 
processor  increases  until  in  the  final  staqe  one  processor  must 
examine  the  relation  in  its  entirety. 

A  second  consideration  that  is  not  immediately  apparent  is 
that  of  cache  memory  usage.  In  order  for  the  parallel  sort  to 
execute  optimally  the  complete  relation  must  be  kept  in  the  cache 
for  the  duration  of  the  sort.  In  the  case  of  a  nested-loons 


join,  only  a  portion  of  the  inner  relation  must  be  kept  in  the 
cache  for  the  duration  of  the  operation  (since  once  a  page  of  the 
inner  relation  has  been  seen  by  all  the  processors  that  page 
frame  can  be  used  for  another  page  of  that  relation). 

The  following  informal  analysis  verifies  our  intuitive  arou— 
ments  about  the  relative  performance  of  the  *-wo  algorithms.  we 


assume  that  the  two  relations  to  be  joined  contain  n  and  m  pages 
and  that  n  >  m.  we  also  assume  that  there  are  p  processors, 
1 5_P£n '  available  to  perform  the  join  and  each  processor  has  a  3 
paae  internal  buffer:  2  for  input  and  1  for  output.  To  simplify 
the  analysis  we  have  also  assumed  that  p,m,n  are  all  powers  of  2. 

For  the  multiprocessor  nested-loops  alqorithm,  the  execution 
time  of  the  join,  tnested-loops '  is  equal  to  n/p*(l+m).  Rach  of 
the  p  processors  will  each  join  n/p  pages  of  the  outer  relation 
with  all  m  paqes  of  the  inner  relation.  Thus  each  oF  the  proces¬ 


sors  will  read  1  paqe  of  the  outer  relation  followed  by  m  paqes 
of  the  inner  relation.  This  will  occur  n/p  times. 

For  the  multiprocessor  sort-merge  alqorithm,  the  execution 

tI"e  ot  the  join'  tsort-merqe*  ls  ^ual  to  tsort ,outer)  + 
tsort(lnner)  +  tmerqe '  where  tmerae  "  n  +  in.  The  sort  time  for 
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each  relation,  t  .  equals  t  ,  .  .  .  +  t  .  .  ,,  where 

'  sort  suboptimal  optimal 

(*>  ‘suboptimal  ■  2n/P*tloo,(n/2p)+ll 
'**>  ‘optimal  *  'ln-2n/p- 

The  derivation  of  (*)  follows.  At  each  suboptimal  level  the  p 
processors  do  the  same  amount  of  work:  2n/p  I/O  operations  (n/p 
reads  and  n/p  writes).  At  level  i,  i>0,  each  processor  sees  2 
"runs"  whose  lenoth  is  2.1  ^  panes  ;nd  produces  one  run  whose 
length  is  21  panes.  The  first  optimal  level  is  that  leve]  whose 
input  runs  are  each  of  lenoth  n/2p  (thus  each  of  the  p  processors 
inputs  exactly  n/p  panes).  This  is  level  number  loq2 ( n/2p) +1 . 
Since  we  start  counting  levels  from  0  there  are  log2(n/2p)+l 
suboptimal  levels. 

To  derive  (**)  we  note  that  in  each  level  we  do  twice  as 

much  work  as  in  the  precedino  level.  we  start  with  n/p  reads  and 

n/p  writes  by  each  processor.  There  are  lon2(2p)  optimal  levels. 

Thus  tontirna^  is  t^ie  suni  0,1  the  loq?(2p)  terms  2n/p,  4n/p,...,2n. 

This  sum  simplifies  to  ^n-2n/p. 

Our  two  final  formulas  are  then: 

t  .  .  .  =  n/p*(l+m),  and 

nested-loops  ' 


'sort-me  roe 


=  2n/p* [ log2 ( n/2p) +11 +4n-2n/p 


+  2m/p* ( 1 oq^ (m/2p) +1 1  +4m-2m/p 
+  n+m 

=  2n/p*log2(n/2p) +2m/p*loo2 (m/2p)+Sn+5m 
It  is  clear  that  as  p  approaches  n,  the  nested-loops  aloo- 
rithm  outperforms  the  sort-merge.  For  very  small  values  of  p 
(relative  to  n)  the  opposite  is  true.  This  is  as  expected  since 
the  uniprocessor  sort-merne  alnorithm  has  nlogn  complexity  while 
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the  nest-loops  alqorithm  has  n'  complexity.  Fiqures  2. 2-2. 4 
present  the  resuUs  of  the  behavior  of  the  two  algorithms  for 
three  different  joins.  '*'e  see  that  the  nested-loops  alqorithm 
generally  outperforms  the  sort-meroe  algorithm  when  the  number  of 
processors  executing  the  join  is  a  fairlv  small  fraction  of  the 
"optimal"  number  of  processors  for  the  nested-loops  aloorithm 
(i.e.  the  number  of  panes  in  the  laroer/outer  relation).  Tt 
should  also  be  noted  that  at  ontimal  levels  the  nested-loops 
alqorithm  outperforms  the  snrt-meroe  accordinn  to  the  ratio 
(l+m)/S* (n+m) . 

2. 1^.2.  Parallel  Update  Operations 

The  followinn  algorithms  are  emnloved  by  htrfct  for  the 
three  update  operators  append,  delete,  and  renlace.  nelet.es  are 
implemented  as  negated  restricts.  To  Perform  an  append  we  exe¬ 
cute  a  variation  of  a  meroe.  '*'e  assume  that  all  pages  are  sorted 
on  either  the  kev  or  the  entire  tuple.  The  tuples  to  be  appended 
are  placed  in  as  few  pages  as  possible  in  a  sorted  order.  rach 
processor  executing  the  append  is  given  a  pane  of  the  original 
relation  and  in  turn  all  the  new  pages.  The  processor  examines 
both  paqes  and  upon  finding  a  duplicate  tuple  it  deletes  the 
tuple  from  the  old  page.  Finally  the  new  panes  are  added  to  the 
relation  page  table.  Replace  is  implemented  as  a  modified  delete 
followed  by  an  append.  I_n  each  case  the  cages  o_^  t_he  result 
relation  r ema i n  sorted .  Fach  relation  must  undergo  a  periodic 
reorgan i zat i on  if  pages  become  too  sparse. 
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2.2.  Data-flow  Machines 

A  data-flow  machine  is  an  architecture  devoid  of  a  proqram 
counter  where  instructions  are  enabled  for  execution  as  soon  as 
their  operands  are  present.  Such  a  machine  consists  of  a  memory 
section,  a  processing  section,  and  an  interconnection  device 
between  the  two  sections.  A  memory  cell  contains  an  instruction 
and  room  for  the  operand  data.  As  soon  as  all  the  required  data 
is  present,  the  contents  of  the  cell  are  sent  to  some  processor 
for  execution.  This  frees  the  cell  for  the  execution  of  the  next 
instruction.  Output  from  the  processor  is  sent  via  the  intercon¬ 
nection  device  to  one  or  more  memory  cells,  possibly  enabling  one 
or  more  instruct  ion (s)  in  the  destination  cell(s). 

Various  architectures  for  data-flow  machines  have  been  pro¬ 
posed  [7-11].  These  architectures  diffej  from  each  other  in  many 
wavs.  One  difference  is  the  granularity  of  the  operands  and  the 
tvpes  of  operations  that  the  processors  execute.  For  example, 
Dennis  [7]  talks  about  assigning  such  instructions  as  add  and 
multiply  to  the  processors  whereas  Arvind  [81  and  P.umbaugh  T101 
assign  entire  procedures  to  processors. 

For  data-flow  database  machines  there  are  also  several 
alternative  variable  granulari ties  for  enabling  relational  alge¬ 
bra  operators  in  the  query  tree.  That  is,  the  basic  variable 
used  for  scheduling  decisions  can  be  a  whole  relation,  a  fragment 
of  a  relation,  or  a  single  tuple.  In  Section  1 .  (1 ,  we  will 
describe  and  then  contrast  each  of  the  these  granularities. 

In  order  to  illustrate  our  ideas  we  chose  to  use  the  mit 
machine  [7]  as  a  model  since  it  is  easy  to  understand  and 
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<iescri.be.  Furthermore  we  feel  that  althouqh  the  model  differs 
s iqn i f i ca nt 1 y  fr0m  others  the  basic  results  remain  unchanqed. 
The  machine  organization  of  Figure  2 . 5  depicts  the  model 
described  in  [71.  Althouqh  later,  more  sonh ist i ca ted ,  variations 
have  been  described  in  the  literature  [11]  we  feel  that  they  do 
not  conceptually  differ  from  the  original. 

Tn  the  machine  of  Figure  2.5  the  interconnection  mechanism 
is  divided  into  two  sections.  The  arbitration  network  provides  a 
path  from  every  memory  cell  to  every  processor.  Rnahled  cells 
travel  through  it  to  processors  for  execution.  Result  packets 
are  sent  from  the  processors  through  the  distribution  network  to 
the  memory. 

— *— *  Relational  Query  Processing  i n  Data-Flow  Mach i nes 

we  assume  that  the  instruction  in  each  memory  cell 
corresponds  to  a  node  in  the  query  tree  and  that  the  data  is 
represented  by  paqe  tables,  pointinq  to  paqes  either  in  a  mass- 
storage  cache  or  on  mass  storaae.  Thus  a  relation  can  also  be 
thought  of  as  a  stream  [111  of  panes.  In  order  to  simplify  our 
discussion  we  assume  that  at  the  time  that,  a  memory  cell  tires, 
the  associated  data  panes  are  retrieved  from  a  nass-storane  cache 
and  placed,  together  with  the  control  information,  on  the  arbi¬ 
tration  network.  similarly,  the  distribution  network  places  out¬ 
put.  panes  in  the  mass-storane  cache  and  updates  the  pane  tahles 
in  the  taroet  cells. 

The  processing  of  queries  in  a  d  q  t  a  -  f  1  o  w  f?Phion  is  r  e  1  a  t.  e  d 
to  the  i^ea  of  nroeessino  relational  queries  in  a  pipelined 
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The  MIT  Data-Flow  Machine  model 


12 


fashion  which  has  been  previously  suggested  by  Rnith  and  Qhano 
[121  and  Yao  [131.  There  are,  however,  several  imoortant  differ¬ 
ences  between  the  two  approaches.  In  the  pipelined  approach, 
there  will  be  at  most  one  processor  executinn  each  node  in  the 
tree  and  therefore  the  concurrency  obtained  will  be  limited  by 
the  number  of  nodes  in  the  query  tree.  In  the  data-flow  approach 
we  can  have  any  number  of  processors  executinq  each  node  and  can 
dynamically  adjust  which  processors  are  executino  which  nodes  in 
the  query  tree  in  order  to  maximize  performance .  The  other  major 
difference  is  that  in  the  data-flow  approach  we  never  need  to 
wait  for  one  node  to  completely  finish  before  initiatina  the  sub¬ 
sequent  operator  as  has  been  suqqested  is  necessary  tor  oipelin- 
i nq  [131  . 

_3  •  Th  ree  Operand  G  ranul  a  r  i  t  ies  for  Data-flow  Query  Processinn 

_3.1_.  Relation-level  Granular  ity 

The  coarsest  possible  qranularity  for  enablinq  instructions 
is  the  relation.  That  is,  a  node  in  the  query  tree  is  enabled 
tor  execution  only  when  its  source  operands  have  been  completely 
computed.  Glearlv,  if  the  query  is  in  a  tree  tormat,  all  leaf 
nodes  are  immediately  executable.  A  node  hioher  up  in  the  ciuerv 
tree  is  enabled  whenever  a] I  of  its  descendants  have  finished 
execut i no . 

2 • 2 •  Page-level  Granularity 

In  this  approach  a  naoe  o t  a  relation  is  us^d  for  sobe^u^ipo 
decisions.  This  means  that  an  operator  can  be  initiated  ns  soo^ 
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as  at  least  one  paqe  of  each  participating  relation  exists. 
Assigning  processors  to  operate  on  pages  rather  than  relations 
makes  it  possible  to  cut  down  on  page  traffic  between  the  data¬ 
flow  machine  memory  and  the  mass  storaqe  device (s)  by  distribut¬ 
ing  processors  across  all  nodes  of  the  tree  and  pipelining  Danes 
of  intermediate  relations  between  them. 

In  order  to  evaluate  data-flow  query  processino  which 
employs  relation-level  granularity  with  page-level  granularity  a 
detailed  simulation  of  DIRECT  was  implemented  [41.  while  this 
simulation  measures  the  performance  of  each  data-flow  strateqv  on 
a  multiprocessor  organization  [1,21  which  is  not  a  true  data-flow 
machine  (i.e.  it  has  centralized  control),  we  feel  that  similar 
results  would  be  obtained  if  the  strategies  were  tested  on  a 
machine  with  more  decentralized  control  organi zat ion . 

The  following  assumptions  were  made: 

l*’ K  bvte  operands  for  instruction  packets 

LSI-1 1  s  as  processors  (can  read  a  l.Rx  bvt.e  page  in 

3  3ms) 

The  data  cache  is  constructed  from  Intel  2314  CCm  chips 
Two  TRM  333(1  disk  drives  tor  mass  storage  of  relations 
A  cross-bar  switch  with  broadcast  capabilities  is  used 
to  connect  the  processors  with  the  Rata  cache.  The 
cross-bar  switch  is  a  feature  of  direct,  not  data-flow 
DIRECT . 

In  [41  a  description  of  the  experiments  and  results  is 
presented.  Figure  3.1  shows  the  results  of  the  simulation  for  a 
representative  benchmark  containino  ten  queries  (2  queries  with  1 
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restrict  operator  only,  3  emeries  with  1  join  and  2  restricts 
each,  2  queries  with  2  joins  and  3  restricts  each,  1  query  with  3 
joins  and  4  restricts,  1  auery  with  4  joins  and  4  restricts,  and 
1  query  with  5  joins  and  ^  restricts),  a  relational  database  con- 
taininq  15  relations  with  a  combined  size  of  5.5  meqabytes,  and 
two  ocn  cache  pane  frames  for  each  processor.  As  illustrated  by 
this  experiment  the  paoe-level  o r anu 1 a r i tv  outperforms 
relational-level  qranularity  bv  a  factor  of  about  two  to  one  (The 
interested  reader  is  referred  to  f 4 1  for  results  for  different 
querv  mixes).  These  results  seem  to  verify  the  benefits  of  oipe- 
lininq  paaes  of  relations  up  the  ouerv  tree  in  order  to  minimize 
movement  of  data  between  a  shared  data  cache  and  secondary 
memory. 

Tuple-level  Granularity 

In  this  approach  a  tuple  of  a  relation  is  the  basic  unit 
which  is  used  for  schedulinq  decisions.  This  means  that  an 
operator  can  be  initiated  as  soon  as  at  least  one  tuple  of  each 
pa r t ic i pa t  i  nq  relation  exists.  As  with  peqe-level  oranularitv, 
this  qranularity  also  offers  the  Possibility  of  pioelininn  tuples 
of  intermediate  relations  between  nodes  in  the  auery  tree.  How¬ 
ever,  this  qranularity  places  unnecessa r i 1 v  hiqh  bandwidth 
requirements  on  the  arbitration  network  as  will  be  demonstrated 
below. 

when  the  nested-loons  join  aloorithm  is  applied  with  timle- 
1  eve)  qrani.il  ari  tv,  each  tuple  of  the  outer  relation  will  Ke 
joined  with  everv  tuple  of  the  inner  relation.  Let  the  outer 


relation  be  A  and  the  inner  relation  be  R.  Assume  that  the 
number  of  tuples  in  A  is  n  and  the  number  of  tuples  in  P  is  n. 
Furthermore,  assume  that  each  tuple  in  A  and  R  is  100  bytes  lorn 
and  that  c  represents  the  number  of  overhead  bytes  associated 
with  each  instruction  that  passes  through  the  arbitration  net¬ 
work.  To  execute  the  join,  n*m*(200+c)  bytes  will  have  to  ness 
from  the  memory  throuqh  the  arbitration  network  to  the  processinn 
section. 

Mext  consider  the  bandwidth  requirements  if  this  same  exam¬ 
ple  is  executed  using  page-level  granularity.  Assume  that  each 
paqe  is  1000  bvtes  long.  Therefore,  relation  A  occupies  n/10 
pages  and  relation  R  occupies  m/10  paqes.  Thus  n*m*(20  +  c/100) 
bytes  must  pass  through  the  arbitration  network.  Even  if  one 
ignores  the  overhead  of  sending  a  packet  (which  is  probably  the 
same  for  both  granularities),  the  bandwidth  requirements  of  the 
page  approach  is  1/10  that  of  the  tuple  level  approach. 

while  increasing  the  paqe  size  to  10,000  bytes  will  obvi¬ 
ously  decrease  the  arbitration  network  bandwidth  requirements  by 
another  order  of  magnitude,  such  an  increase  mav  have  an  adverse 
effect  on  query  execution  time  because  it  may  reduce  the  deoree 
of  concurrency  which  is  possible.  If  the  number  of  processors 
available  for  auery  execution  is  approximately  eaual  to  n  *  n, 
tuple-level  oranularitv  is  optimal,  we  feel  that  this  is  unlikelv 
as  typically  the  value  of  n  *  m  v/ill  be  in  the  millions.  There¬ 
fore  for  tvpical  queries  (unless  there  are  millions  of  proces¬ 
sors)  ,  tuple-level  granularity  places  an  unnecessary  burden  on 
the  arbitration  network  without  an  apparent  increase  in 
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performance.  By  sending  pages  of  relations  to  the  processors,  a 
similar  dearee  of  concurrency  can  he  achieved  while  minimizing 
network  traffic. 

—  *  —  Preliminary  Architecture  for  a  Data-flow  Database  Machine 

while  the  architecture  of  the  mit  machine  could  he  used  as 
the  basis  of  a  data-flow  database  machine,  we  have  identified 
several  properties  which  for  a  database  machine  will  unneces¬ 
sarily  limit  its  functionality  and  increase  its  complexity.  The 
mit  machine  [71  is  designed  to  permit  the  simultaneous  execution 
of  the  instructions  from  only  one  program  (or  one  query).  This 
clearly  is  very  restrictive  for  a  multiuser,  database  management 
system  environment. 

Furthermore,  we  feel  that  for  a  database  machine  the  same 
level  of  performance  can  be  achieved  with  an  entirely  different 
design  for  the  arbitration  and  distribution  networks.  These  net¬ 
works  are  responsible  for  instruction  initiation  and  data  distri¬ 
bution.  The  desiqn  of  the  data  distribution  network  is  rela¬ 
tively  straightforward.  its  function  is  to  take  a  result  packet 
produced  by  a  processor  and  store  it  in  those  instruction  cells 
which  are  specified  in  the  packet  header.  The  arbitration  net¬ 
work,  on  the  other  hand,  is  very  complex .  It  must  continuously 
monitor  all  instruction  cells  and  provide  a  mechanism  for  ini¬ 
tiating  several  enabled  instructions  simultaneously  by  routing 
the  contents  of  each  enabled  instruction  to  a  free  processor  for 
execution.  '-'e  feci  that  for  data-flow  database  machines  these 
two  networks  are  too  general  purpose  and  consequently  excessively 
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expensive . 

In  our  approach  the  instruction  memory  and  the  arbitration 
and  distribution  networks  are  replaced  with  a  small  number  of 
relatively  low-performance  processors.  Each  processor  will  be 
responsible  for  controllino  the  execution  of  a  few  (perhaps  onlv 
one)  relational  alqehra  onerations.  Thus  control  of  the  execu¬ 
tion  of  a  query  is  distributed  amonq  a  set  of  processors.  when 
an  instruction  controller  (IC)  is  qiven  a  relational  alqebra 
operation  to  control  it  is  also  qiven  an  initial  allocation  of 
processors  (called  instruction  processors  -  Ips)  for  executinq 
the  instruction.  If  a  typical  query  contains  five  operations, 
then  fifty  I^s  can  maintain  a  multiproqramminq  level  of  at  least 
ten  in  the  database  machine. 

Our  approach  appears  to  be  viable  for  two  reasons:  pronram 
size  (number  of  instructions)  and  execution  time  of  a  typical 
instruction.  One  frequently  mentioned  application  f 7 ]  for  data¬ 
flow  machines  is  larqe  scientific  proorams  (e.a.  weather  oro- 
qrams) .  These  proqrams  Generally  consist  of  thousands  of 
instructions  each  of  which  takes  only  a  few  microseconds  (or 
less)  to  execute.  Even  if  the  instruction  operates  on  operands 
of  type  vector,  multiple  processors  can  be  used  to  work  on  indi¬ 
vidual  elements  and  hence  instruction  execution  time  will  still 
be  in  the  microsecond  ranqe.  for  these  applications  a  larqe 
instruction  memory  is  required  to  hold  the  entire  pronram.  Eince 
each  instruction  cell  has  one  input  to  the  arbitration  network, 
the  size  of  the  arbitration  network  is  proportional  to  that  of 
t ho  instruction  memory.  The  arbitration  and  distribution 
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networks  must  also  he  extremely  fast.  For  example,  if  100  one 
microsecond  (execution  time  of  a  typical  instruction)  processing 
elements  are  to  he  kept  busy,  the  arbitration  network  must  be 

ft 

capable  of  routinq  10  packets/second. 

Relational  algebra  aueries,  on  the  other  hand,  are  composed 
of  relatively  few  instructions  (typically  1-10  operations)  each 
of  which  takes  a  relatively  lonq  time  to  execute  (in  the  mil¬ 
lisecond  to  second  ranqe) .  Also  packets  originatinn  from  one  IC 
are  sent  to  a  fixed  suhset  of  instruction  processors,  as,  for 
example,  are  the  inner  relation  pages  in  the  join.  This  permits 
us  to  replace  the  instruction  memory  and  the  two  networks  with  a 
set  of  processors  without  any  loss  of  performance  or  functional- 
i  ty. 


£.£.  Hardware  Organization  and  General  Operation 

In  this  section  we  present  one  possihle  desian  for  a  data¬ 
flow  database  machine.  Our  purpose  in  studying  this  architecture 
is  to  enable  us  to  learn  more  about  problems  associated  with 
data-flow  database  machines.  This  ring-based  organization  is  of 
course  limited  by  the  commun ica t i on  medium  bandwidth.  However, 
it  will  later  be  shown  that  bandwidth  requirements  placed  on  the 
rinq  for  a  fairly  larqe  configuration  are  not  unreasonable.  The 
organization  contains  six  major  components: 

1)  The  master  controller  (MC) . 

2)  A  set  of  instruction  controllers  (IC) . 

3)  A  communications  rino  (inner  ring)  which  connects  the 


master  controller  with  the  instruction  controllers. 
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4)  A  mass  storaqe  system  with  a  multiport  disk  cache. 

5)  A  set  of  instruction  processors  (IP). 

f )  A  communications  .  rinq  (outer  rinq)  which  connects  the 
instruction  processors  with  the  instruction  controllers. 

The  mc  serves  a  number  of  functions.  The  first  is  to  handle 
communications  with  the  host  processor.  when  a  user's  auery  (in 
the  form  of  a  query  tree)  is  received  by  the  mc  it  is  placed  in  a 
queue  of  queries  awaitinq  execution.  '*Then  system  resources  (ICs 
and  IPs)  become  available,  the  mc  removes  the  next  ouerv  from  the 
queue,  checks  it  for  concurrency  conflicts  with  other  execution 
queries,  and  then  distributes  a  subset  of  the  instructions  from 
the  query  to  a  set  of  instruction  controllers.  The  other  func- 
f.ions  of  the  mc  are  to  control  utilization  of  the  disk  cache 
a mono  the  ICs  and  to  control  IP  allocation. 

Each  IC  is  responsible  tor  controllinq  one  or  more  instruc¬ 
tions.  Controllinq  an  instruction  involves  first  acnuirino  a  set 
of  IPs  from  the  mc  and  then  distrihutinq  instruction  packets  (see 
Section  4.2)  to  the  allocated  IPs.  Thus  the  ICs  compete  with 
each  other  for  the  processors  in  the  IP  pool.  The  M c  is  respon¬ 
sible  for  arbitration  of  the  requests  in  a  manner  which  maximizes 
system  performance  by  insurinq  that  processors  are  distributed 
across  all  nodes  in  a  query  tree. 

Fach  IC  has  a  local  memorv  for  paces  of  source  relations 
which  will  he  used  as  operands  in  the  instruction  packets  it  dis¬ 
tributes  to  the  IPs.  ’«'hen  the  local  memorv  of  an  fills,  the 
IC  will  write  the  least  desirable  oaqes  to  the  multiport  disk 
cache.  One  possible  approach  for  controllinq  usaqe  of  the  -M  s  k 
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Figure  4,1 

A  Data-Flow  Database  Machine  Configuration 
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cache  is  to  divide  it  anono  the  Ids  accordinn  to  the  number  of 
IPs  each  is  controlling.  ’»'hen  an  T^  fills  its  segment  of  the 
disk  cache,  pages  will  be  swapped  out  to  disk.  Thus,  the  Id 
local  memory/  the  disk  cache,  and  the  mass  storage  devices  form  a 
three-level  storaqe  hierarchy. 

TPs  are  responsible  for  executing  instruction  packets  which 
are  placed  on  the  outer  ring  by  the  Ids.  when  an  IP  receives  an 
instruction  packet  addressed  to  it,  it  performs  the  operation 
specified  in  the  packet  ’d  then  produces  an  output  packet.  The 
IP  then  places  the  output  packet  on  the  outer  ring  and  sends  it 
to  the  IC  which  is  responsible  for  controlling  the  suhseauent 
operation  in  the  query  tree.  Thus,  the  TPs  and  the  outer  ring 
form  a  distributed  distribution  network  for  result  packets. 

The  inner  rino,  as  has  been  discussed  above,  is  used 
exclusively  for  distribution  of  instructions  and  other  control 
messacies  by  the  mc  .  Since  the  messaoes  renuired  for  such  activi¬ 
ties  are  small  and  limited  in  number,  a  bandwidth  of  1-2  million 
bits  per  second  (vbps)  should  he  suffioient. 

The  outer  rino,  on  the  other  hand,  is  use^  for  distribution 
of  instructions  and  result  packets  bv  the  TCs  and  TPs.  Finure 
4.2  represents  the  bandwidth  requirements  of  PIRFCT  Ml  with 
page-level  granularity  for  the  test  data  described  in  Section 
3.2.  The  bandwidth  for  each  of  the  different  processor  levels 
was  ohtained  by  dividing  the  total  number  of  bvtes  transferred  by 
the  execution  time  of  the  benchmark  containing  ten  queries. 
Thus,  the  bandwidth  values  represent  averaoe  '/allies  and  not  peak 
load  values.  (The  bandwidth  requirement  for  7C1  processors 
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represents  a  schedulinq  anomaly  which  is  also  reflected  in  Finure 
3.1)  . 

There  are  several  possible  technoloqies  for  the  rino  orctani- 
zation  which  we  intend  to  investiqate.  Fine  possible  technology 
is  that  which  is  used  in  the  Distributed  Loop  Computer  Network 
[14].  This  network  employs  a  technique  known  as  sh i f t-rea i ste r 
insertion  and  can  handle  the  transmission  of  variable  lenqth  mes- 
saqes.  A  rinq  bandwidth  of  up  to  4flMbps  can  be  obtained  in  this 
fashion.  Some  alternatives  are  loops  constructed  usino  either 
fiber  optic  technoloqy  or  broadband  coaxial  technolooy.  Fiber 
optics  can  support  handwidths  of  400  Mbps  T151  and  should  be 
commercially  available  in  the  next  5-10  years.  Rroadhand  coaxial 
technolooy  is  claimed  to  be  capable  of  100  Mbps. 

4 . 7  .  Instruction  Control  and  Execution 

When  an  instruction  is  assiqned  to  an  IC  it  can  be  in  one  nf 
two  states.  If  the  instruction's  operands  are  source  relations 
in  the  database,  then  the  instruction  is  ready  to  be  executed.  In 
this  case  the  mc  will  also  send  to  the  TC  a  pace  table  descrihinn 
each  operand,  otherwise,  if  the  instruction  is  not  enabled,  the 
TC  will  first  create  a  paqe  table  for  each  operand  op  the 
instruction  and  then  wait  for  panes  of  the  source  operand(s)  to 
arrive  from  IPs  beinn  controlled  by  another  IC.  As  panes  (which 
may  not  be  full)  arrive,  they  Are  compressed  to  form  Full  panes 
[31  and  then  stored  in  the  IC's  local  memory  or  its  senment 
the  disk  cacthe. 

'*’hen  an  IC  is  ready  to  initiate  the  execution  of  an  instruc- 
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tion  (i.e.  at  least  one  paqe  of  each  operand  is  present),  the  TC 
first  sends  a  control  packet  to  the  MC  which  requests  an  initial 
allocation  of  IPs  and  disk  cache  paqe  frames.  If  the  requested 
allocation  cannot  he  fully  satisfied,  the  MC  will  respond  with  a 
list  of  the  IPs  and  paqe  frames  which  are  currently  available, 
when  another  instruction  has  terminated,  the  MC  will  send  the 
remaininq  requested  resources  to  the  IC. 

The  packet  format  of  instruction  packets  sent  by  an  IC  to 
one  of  its  IPs  is  shown  in  Fioure  4.3.  The  destination  of  the 
packet  is  controlled  by  the  IPid  field.  It  is  important  to  note 
that  since  packets  are  not  fixed  lenqth,  it  is  possible  for  the 
IC  to  send  varyina  size  data  paqes  as  source  operands.  In  this 
way  maximal  concurrency  can  be  achieved  while  the  bandwidth 
requirements  of  the  commun i ca t i ons  medium  are  minimized. 

Upon  receipt  the  IP  applies  the  operation  code  to  the  data 
paqes  contained  in  the  packet.  Tuples  of  the  result  relation  are 
placed  by  the  IP  in  an  internal  buffer.  The  IP  informs  the  con- 
trollinq  IC  that  it  is  done  by  sendino  it  a  control  packet  (Fia- 
ure  4.4).  The  IC  can  respond  by  either  sendino  additional  pack¬ 
ets  or  by  releasino  the  IP.  when  the  IP's  internal  buffer  tills 
or  when  the  flush-when-done  tiaq  of  the  instruction  packet  is  on, 
the  IP  sends  the  contents  of  its  buffer  in  a  result  packet  (Fia- 
ure  4.5)  to  the  destination  IC  specified  in  the  instruction 
packet.  The  controllina  IC  will  turn  the  flush-when-done  flan  on 
when  it  expects  the  outooinq  packet  to  be  the  last  one  which  will 
be  sent  to  the  IP. 

when  an  IP  first  receives  an  instruction  packet  for  a  ioin 
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operation,  it  sets  up  an  "inner-relation  control"  (IRC)  vector 
with  one  entry  for  each  pane  of  the  inner  relation.  (Initially 
this  vector  will  have  only  one  entrv,  but  the  vector  will  arow  as 
execution  of  the  instruction  pronresses.)  After  the  IP  has  joined 
the  first  pace  of  the  outer  relation  with  the  first  paqe  of  the 
inner  relation  (the  two  operands  in  the  packet) ,  the  IP  will  send 
a  "done"  control  packet,  to  the  control!  inn  IC.  Included  in  this 
packet  will  he  a  request  for  the  second  paqe  of  the  inner  rela¬ 
tion.  The  IC  responds  to  this  request  hv  broadcastina  the 
requested  paae  to  all  IPs  which  are  execution  the  join.  (An  IP 
can  determine  it  a  broadcast  packet  is  meant  for  it  by  examininq 
the  Query  ID  field  of  the  packet! .  Subsequent  requests  tor  the 
same  paqe  which  are  received  by  the  IC  "soon"  afterwards  ran  he 
ionored . 

Bach  TD  which  receives  the  broadcast  packet  nan  be  in  one  of 
several  states.  If  the  IP  has  al ready  sent  or  is  about  to  send  a 
request  for  the  same  pace  to  the  IC,  then  the  TP  can  proceed  to 
join  the  new  pane  of  the  inner  relation  with  its  current  pane  of 
the  outer  and  update  its  IPC  vector  appropr i atel v.  If  the  IP 
does  not  have  room  in  its  local  memory  for  the  broadcast  pane,  it 
will  iqnore  the  packet.  However,  the  followinq  scenario  mav 
occur.  Because  its  local  buffer  is  full,  an  ID  iqnores  pane  i  of 
the  inner  relation.  '»'hen  broadcast  pane  i  +  1  is  received  (before 
it  or  pace  i  has  been  solicited  hv  the  TP),  the  TP  will  read  pane 
i+1  and  use  it  as  an  operand  pane.  This  situation  can  continue 
until  a  packet,  is  received  which  indicates  that  this  is  the  last 
pane  of  the  inner  relation.  At.  this  point  each  TP  will  examine 
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its  IRC  vector  and  then  proceed  to  request  those  panes  which  it 
missed,  when  the  IP  has  joined  the  current  page  of  the  outer 
relation  with  all  the  paqes  of  the  inner  relation,  it  will  first 
zero  its  IRC  vector  and  then  signal  the  IC  that  it  is  ready  for 
another  pace  of  the  outer  relation  which  has  not  yet  been  distri¬ 
buted  to  an  IP.  In  this  way  messaae  traffic  on  the  outer  rino 
is  minimized  and  yet  correct  operation  of  the  join  can  be 
quaranteed . 

fi .  Conclusions  and  Future  Research 

In  this  paper  we  have  presented  the  use  of  data-flow  machine 
techniques  for  the  processinq  of  relational  algebra  queries.  The 
performance  of  two  multiprocessor  join  algorithms  was  analyzed 
and  it  was  shown  that  the  nested-loops  algorithm  is  aenerallv 
superior  to  the  sort-meroe  algorithm.  '*'e  have  also  discussed 
alternative  operand  q ranula r i t i es  for  data-flow  database  machines 
and  have  demonstrated  that  page-level  granularity  is  the  best 
choice  for  optimum  system  performance.  A  preliminary  design  for 
a  data-flow  database  machine  which  utilizes  paoe-level  granular¬ 
ity  and  supports  distributed  control  of  instruction  execution  has 
been  described. 

There  are  several  features  of  our  proposed  design  with  which 
we  are  not  completely  satisfied  and  which  warrant  further  inves¬ 
tigation.  In  particular,  we  feel  that  it  should  be  possible  to 
route  some  of  the  data  pages  which  are  produced  by  IPs  directlv 
from  one  IP  to  another  without  first  sending  the  pace  to  an  IC. 
Thus,  instruction  execution  control  would  be  distributed  further. 
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If  such  an  approach  could  he  successfully  designed  and  imple¬ 
mented  message  traffic  on  the  outer  ring  could  he  further 
reduced.  There  appears,  however,  to  he  a  tradeoff  between 
decreased  message  traffic  and  increased  IP  complexity  which  needs 
further  examination  before  the  correct  approach  can  be  chosen. 

Another  area  of  research  which  we  intend  to  pursue  is  con¬ 
currency  control  mechanisms  for  data-flow  database  machines.  In 
our  current  design,  the  MC  is  responsible  for  all  concurrency 
control.  We  intend  to  investigate  a  distributed  mechanism  in 
which  the  ICs  and  not  the  MC  would  be  responsible  for  concurrency 
control.  One  can  view  a  data-flow  database  machine  (such  as 
described  in  Section  4.0)  as  a  "local"  distributed  database  sys¬ 
tem  in  which  the  ICs  correspond  to  distributed  centers  of  query 
execution  and  control.  As  a  first  step  we  intend  to  examine  the 
mechanisms  which  have  been  proposed  for  concurrency  control  in 
distributed  database  systems  to  see  if  they  are  applicable.  we 
intend  to  evaluate  the  performance  of  each  of  these  algorithms  to 
determine  how  each  performs  in  a  "local"  environment.  Then  based 
on  our  findinqs  we  will  either  adopt  one  of  the  existina  aloo- 
rithms  or  attempt  to  develop  a  new  alnorithm  which  takes  advan¬ 
tage  of  the  "local"  nature  of  the  ICs. 

while  the  rina  architecture  we  have  proposed  seems  to 
satisfy  the  organizational  requirements  for  data-flow  database 
machines,  the  requirement  for  a  hiqh  bandwidth  communications 
medium  may  not  be  realistic  for  a  larae  number  of  IPs.  The  other 

MIMD  database  machines  which  have  been  proposed  have  depended  on 

.  .  ? 
processor-memory  interconnections  in  complexity  of  O(n')  [1,21  to 
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O(nlogn)  [15-17].  We  feel  that  the  0(n)  nature  of  the  ring 
organization  is  the  best  if  high  (100  Mbit)  bandwidth  rings 
become  available.  However,  we  intend  to  investiqate  other  pro¬ 
cessor  interconnection  strateqies  for  data-flow  database  machines 
which  satisfy  the  requirements  specified  in  Section  4.0  yet  which 
can  be  constructed  from  existing  techno] oqies . 
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